Segmentation

University of San Francisco

Matt Meister

Segmentation in Marketing

  • Objectives
  • Steps
    • Data selection
    • Distance measures
    • Clustering algorithms
    • Selecting the number of clusters
    • Profiling of clusters

Segmentation in Marketing

  • Market segmentation is one of the most fundamental marketing concepts:
    • Grouping people (with the willingness and ability to buy) according to their similarity on dimensions related to product(s) under consideration
  • Better segments chosen -> better success
    • Customize products and distribution strategies for target segments
      • Higher customer satisfaction, retention
    • Customize promotions for target segments
      • Higher customer acquisition, retention, up-selling
    • Customize prices for target segments
      • Extract as much $ from targeted customers as possible

Segmentation: Marketer’s dilemma

We’re making a trade-off as marketers between:

  • Market segmentation (Heterogeneous market)
  • Market aggregation (Homogeneous market)

Segmentation: Marketer’s dilemma

  • Market segmentation (Heterogeneous market)
    • Higher revenue
    • Individually-customized products
      • High production/admin costs
      • Appeals to many
      • Priced near consumer WTP
  • Market aggregation (Homogeneous market)
    • Lower costs
    • Single product
      • Low production/admin costs
      • Does not appeal to some
      • Priced too low for some
  • Forming a few market segments strikes balance between extremes

Segmentation

Criteria

  • Substantial
    • Measurable market size
    • Large enough to warrant serving
  • Actionable
    • Segment characteristics can be translated into targeted marketing policies
      • e.g., age/income differences suggest different promotional vehicles
    • Targeted policies must be consistent with firm abilities
  • Differential
    • Differences between segments should be clearly defined
    • Segment-specific marketing policies can be implemented without overlap

Segmentation

Approaches

  • Manual
    • Managerial experience, industry norms used to determine segments
    • Bad
  • Automatic
    • Form segments using data-driven techniques
  • Hybrid
    • Combination of manual & automatic (often best)
  • Cluster analysis is the most commonly used data-driven technique for segmenting customer data

Cluster analysis

Cluster analysis

  • What?
    • Grouping of objects (e.g. customers) by similarity of attributes (variables)
    • Objects in a group similar, distinct from objects in other groups
    • Natural relationship to market segmentation
  • How?
    • Calculates pairwise similarity measures
      • Some measure of “distance” between objects (customers)
      • Many possible distance metrics (Euclidian, Gower, etc.)
    • Searches for “best” groupings using a clustering method (algorithm)
      • Many possible clustering methods (k-means, hierarchical, etc.)

Cluster analysis

Steps

  1. Select variables to use for clustering
  2. Define distance measure between individuals (Euclidean, Gower, etc.)
  3. Select clustering procedure (k-means, hierarchical, etc.)
  4. Select number of clusters
  5. Interpret and profile the clusters

Step 1: Select variables

Data sources

  • Past behavior & derived metrics (firm’s CRM/transactions database)
    • Past expenditure levels
    • Recency & frequency of purchase
    • Lifetime value (CLV)
  • Preference measures (CRM, survey/experimental research)
    • Conduct individual demand analyses, recover individual-specific parameters
    • Use individual-specific parameters as input to cluster analysis
  • Demographic variables (CRM database, Census records)
    • Directly observed — often limited, but sometimes directly observe gender, age, etc.
    • Imputed from geography — e.g., use knowledge of home ZIP to match Census demographics for region

Step 1: Select variables

DF <- read.csv('apparel_customer_data.csv')
summary(DF)
      iid         spend_online      spend_retail          age       
 Min.   :   14   Min.   :   0.00   Min.   :   0.00   Min.   :18.00  
 1st Qu.: 2946   1st Qu.:   0.00   1st Qu.:   0.00   1st Qu.:33.00  
 Median : 5430   Median :  14.97   Median :  27.71   Median :41.00  
 Mean   : 5463   Mean   :  72.44   Mean   :  78.00   Mean   :40.91  
 3rd Qu.: 8110   3rd Qu.:  70.72   3rd Qu.:  78.00   3rd Qu.:49.00  
 Max.   :10589   Max.   :1985.75   Max.   :2421.91   Max.   :88.00  
     white           college            male           hh_inc       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :  2.499  
 1st Qu.:0.7297   1st Qu.:0.3835   1st Qu.:0.000   1st Qu.: 59.356  
 Median :0.8550   Median :0.5580   Median :0.000   Median : 87.364  
 Mean   :0.7993   Mean   :0.5437   Mean   :0.091   Mean   : 96.254  
 3rd Qu.:0.9422   3rd Qu.:0.7136   3rd Qu.:0.000   3rd Qu.:122.602  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :250.001  

Step 1: Select variables

Data pre-processing

  • Clustering algorithms typically perform better on normally distributed data

Step 1: Select variables

Data pre-processing

  • Clustering algorithms typically perform better on normally distributed data
  • Often, we “normalize” data, using two techniques:
    • Log-transform of highly skewed variables — to reduce skew & outlier influence
    • Standardization of variables — de-mean and rescale variance to 1
  • Best practice is to generate histograms of potential cluster variables
    • Infer from distribution plots which variables to log-transform

Step 1: Select variables

Data pre-processing

DF$log_spend_online <- log(1+DF$spend_online)

Step 1: Select variables

Data pre-processing

DF$log_spend_retail <- log(1+DF$spend_retail)

Step 2: Define distance between individuals

  • We measure similarity between two customers by calculating the “distance” between them
    • Distance can be measured in different ways, depending on data type
  • Two most general-purpose distance metrics
    • Euclidean — if all data is continuous, and not multi-modal
    • Gower — can be used with mixed (discrete data/dummy variables and continuous) data; also can work better with multi-modal data

Step 2: Define distance between individuals

  • Euclidean distance (continuous data)
    • Usually applied to standardized data to give the same weight to all the variables
    • Essentially the standard deviation between customers \(i\) and \(j\) (over \(K\) variables):
      • Calculate difference between \(i\) and \(j\) on each individual variable
      • Square all of those (to make them positive)
      • Add those up
      • Take square root of that

Step 2: Define distance between individuals

Is 199 more similar to 1163 or 9594?

What is the Euclidean distance between customers 199 and 1163 on spend_online, hh_inc?

Standardize first

DF$spend_online_standardized <- scale(DF$spend_online)[,1]
DF$hh_inc_standardized <- scale(DF$hh_inc)[,1]
spend_online_dif <- ( DF[ DF$iid == 199, ]$spend_online_standardized - DF[ DF$iid == 1163, ]$spend_online_standardized) ^ 2
hh_inc_dif <- ( DF[ DF$iid == 199, ]$hh_inc_standardized - DF[ DF$iid == 1163, ]$hh_inc_standardized) ^ 2

ed_199_1163 <- sqrt( spend_online_dif + hh_inc_dif)
ed_199_1163
[1] 0.3264905

Step 2: Define distance between individuals

What is the Euclidean distance between customers 199 and 9594 on spend_online, hh_inc?

spend_online_dif <- ( DF[ DF$iid == 199, ]$spend_online_standardized - DF[ DF$iid == 9594, ]$spend_online_standardized) ^ 2
hh_inc_dif <- ( DF[ DF$iid == 199, ]$hh_inc_standardized - DF[ DF$iid == 9594, ]$hh_inc_standardized) ^ 2

ed_199_9594 <- sqrt( spend_online_dif + hh_inc_dif)
ed_199_9594
[1] 1.273762

Step 2: Define distance between individuals

  • Gower distance (mixed continuous & dummy variables)
    • Also sometimes useful for continuous data with more than 1 mode
    • Standardization is effectively “built-in” to the distance formula

Step 2: Define distance measure between individuals

R implementation of Euclidean & Gower distance

daisy() function in cluster package

  • Euclidean:
    • daisy(DF, metric = "euclidean", warnType = FALSE, stand = TRUE)
      • DF: dataframe with all continuous variables
      • warnType=FALSE to silence warnings
      • stand=TRUE to standardize variables
  • Gower distance:
    • daisy(DF, metric = "gower", warnType=FALSE)
      • DF: dataframe with continuous/binary variables
      • warnType=FALSE to silence warnings

Step 2: Define distance measure between individuals

R implementation of Euclidean & Gower distance

daisy() function in cluster package

DF <- read.csv('apparel_customer_data.csv')

library(cluster)

DF_euclidean <- daisy(DF, metric = "euclidean", warnType = FALSE, stand = TRUE)

Step 2: Define distance measure between individuals

R implementation of Euclidean & Gower distance

daisy() function in cluster package

DF_gower <- daisy(DF, metric = "gower", warnType = FALSE)

Step 3: Select clustering procedure

Two main branches of clustering algorithms

  • Hierarchical
    • Iteratively build up clusters from nearest (distance-wise) pairs
    • Cluster “tree” cut at depth = # of desired clusters
  • Nonhierarchical
    • Number of segments is specified by the analyst
    • K-means maximizes ratio of between-cluster variance to within-cluster variance
    • Model based approaches use classifiers such as logistic regression
  • Most often, with “right” distance measure, get similar results
    • We focus on k-means

Step 3: Select clustering procedure

K-means clustering algorithm

  • Each cluster is associated with a centroid (center point)
  • Each point is assigned to the cluster with the closest centroid
  • Number of clusters, \(K\), must be specified
  • The basic algorithm is very simple:
    • Randomly select \(K\) points as the initial centroids
    • Assign all other points to the cluster with the nearest centroid
    • Re-compute centroid as the average in that cluster
    • Reassign each point to the cluster with the nearest centroid
    • Re-compute centroid as the average in that cluster
    • Repeat until points don’t change

Step 3: Select clustering procedure

K-means clustering algorithm

Step 3: Select clustering procedure

K-means clustering algorithm

  • Select 3 points as the start:
centroids <- sample( 1:nrow(df), 3)

Step 3: Select clustering procedure

K-means clustering algorithm

  • Assign points to closest cluster:
df$distance_1 <- sqrt( (df$x - df[centroids[1],]$x) ^ 2 + (df$y - df[centroids[1],]$y) ^ 2)
df$distance_2 <- sqrt( (df$x - df[centroids[2],]$x) ^ 2 + (df$y - df[centroids[2],]$y) ^ 2)
df$distance_3 <- sqrt( (df$x - df[centroids[3],]$x) ^ 2 + (df$y - df[centroids[3],]$y) ^ 2)

df$cluster <- ifelse(df$distance_1 < df$distance_2 & df$distance_1 < df$distance_3, 1,
                     ifelse( df$distance_2 < df$distance_1 & df$distance_2 < df$distance_3, 2,
                             ifelse( df$distance_3 < df$distance_1 & df$distance_3 < df$distance_2, 3, NA_real_)))

Step 3: Select clustering procedure

K-means clustering algorithm

  • Calculate new centroids:
centroids <- data.frame(
  cluster = as.factor(c(1, 2, 3)),
  x = c(mean(df[ df$cluster == 1, ]$x), mean(df[ df$cluster == 2, ]$x), mean(df[ df$cluster == 3, ]$x)),
  y = c(mean(df[ df$cluster == 1, ]$y), mean(df[ df$cluster == 2, ]$y), mean(df[ df$cluster == 3, ]$y))
)

Step 3: Select clustering procedure

K-means clustering algorithm

  • Re-assign points:
df$distance_1 <- sqrt( (df$x - centroids[1,]$x) ^ 2 + (df$y - centroids[1,]$y) ^ 2)
df$distance_2 <- sqrt( (df$x - centroids[2,]$x) ^ 2 + (df$y - centroids[2,]$y) ^ 2)
df$distance_3 <- sqrt( (df$x - centroids[3,]$x) ^ 2 + (df$y - centroids[3,]$y) ^ 2)

df$cluster_2 <- ifelse(df$distance_1 < df$distance_2 & df$distance_1 < df$distance_3, 1,
                     ifelse( df$distance_2 < df$distance_1 & df$distance_2 < df$distance_3, 2,
                             ifelse( df$distance_3 < df$distance_1 & df$distance_3 < df$distance_2, 3, NA_real_)))

sum(df$cluster != df$cluster_2)
[1] 8

Step 3:

  • Calculate new centroids:
centroids <- data.frame(
  cluster = as.factor(c(1, 2, 3)),
  x = c(mean(df[ df$cluster_2 == 1, ]$x), mean(df[ df$cluster_2 == 2, ]$x), mean(df[ df$cluster_2 == 3, ]$x)),
  y = c(mean(df[ df$cluster_2 == 1, ]$y), mean(df[ df$cluster_2 == 2, ]$y), mean(df[ df$cluster_2 == 3, ]$y))
)

Step 3: Select clustering procedure

K-means clustering algorithm

  • Re-assign points:
df$distance_1 <- sqrt( (df$x - centroids[1,]$x) ^ 2 + (df$y - centroids[1,]$y) ^ 2)
df$distance_2 <- sqrt( (df$x - centroids[2,]$x) ^ 2 + (df$y - centroids[2,]$y) ^ 2)
df$distance_3 <- sqrt( (df$x - centroids[3,]$x) ^ 2 + (df$y - centroids[3,]$y) ^ 2)

df$cluster_3 <- ifelse(df$distance_1 < df$distance_2 & df$distance_1 < df$distance_3, 1,
                     ifelse( df$distance_2 < df$distance_1 & df$distance_2 < df$distance_3, 2,
                             ifelse( df$distance_3 < df$distance_1 & df$distance_3 < df$distance_2, 3, NA_real_)))

sum(df$cluster_2 != df$cluster_3)
[1] 10

Step 3:

  • Calculate new centroids:
centroids <- data.frame(
  cluster = as.factor(c(1, 2, 3)),
  x = c(mean(df[ df$cluster_3 == 1, ]$x), mean(df[ df$cluster_3 == 2, ]$x), mean(df[ df$cluster_3 == 3, ]$x)),
  y = c(mean(df[ df$cluster_3 == 1, ]$y), mean(df[ df$cluster_3 == 2, ]$y), mean(df[ df$cluster_3 == 3, ]$y))
)

Step 3:

K-means clustering algorithm

  • Re-assign points:
df$distance_1 <- sqrt( (df$x - centroids[1,]$x) ^ 2 + (df$y - centroids[1,]$y) ^ 2)
df$distance_2 <- sqrt( (df$x - centroids[2,]$x) ^ 2 + (df$y - centroids[2,]$y) ^ 2)
df$distance_3 <- sqrt( (df$x - centroids[3,]$x) ^ 2 + (df$y - centroids[3,]$y) ^ 2)

df$cluster_4 <- ifelse(df$distance_1 < df$distance_2 & df$distance_1 < df$distance_3, 1,
                     ifelse( df$distance_2 < df$distance_1 & df$distance_2 < df$distance_3, 2,
                             ifelse( df$distance_3 < df$distance_1 & df$distance_3 < df$distance_2, 3, NA_real_)))

sum(df$cluster_4 != df$cluster_3)
[1] 5

K-means in R

Once we have the distance (similarity) matrix from daisy():

  • We call kmeans() to perform the cluster analysis
    • Applies to both Euclidean/Gower distance matrices
    • Use nstart option to multi-start algorithm from multiple random points
      • Recommended range: 10 to 25 (or more), depending on data regularity
  • Syntax:
    • 4 segments, using a Gower distance matrix
      • DF$clu_gower_4 <- kmeans(DF_gower, centers = 4, nstart = 10)
    • 4 segments, using a Euclidean distance matrix
      • DF$clu_euclid_4 <- kmeans(DF_euclidean, centers = 4, nstart = 10)

K-means in R

In total:

# Calculate euclidean distance
DF_euclidean <- daisy(DF, metric = "euclidean", warnType = FALSE, stand = TRUE)

# Create segments
segments <- kmeans(DF_euclidean, centers = 4, nstart = 10)

# Assign
DF$clu_euclid_4 <- segments$cluster

You can also do this in one step, but I wouldn’t recommend it

  • Calculating the distances takes time
DF$clu_euclid_4 <- kmeans(
  daisy(DF, metric = "euclidean", warnType = FALSE, stand = TRUE), 
  centers = 4, nstart = 10)$cluster

Step 4: Select number of clusters

  • To perform k-means, we must specify the number of clusters
  • How to determine “how many” clusters?
    • Statistical guidance — elbow plots
    • Segmentation criteria — substantial, actionable, differentiable
  • Elbow plots
    • Graph the total within-cluster variation (the sum of squared distances from points to their cluster centroids) vs. # of clusters
    • Within-cluster variation decreases as \(k\) gets larger, because when the number of clusters increases, within-cluster distances get smaller
    • The idea of the elbow method is to choose \(k\) at which the within-cluster variation decreases abruptly

Step 4: Select number of clusters

Elbow plot example from df

What was the within-cluster variation from the example I had?

The kmeans() output has a $withinss result, similar to how it has a $cluster result

round( sum(kmeans(df_euclidean, centers = 3, nstart = 10)$withinss), 3)
[1] 6006.087

Step 4: Select number of clusters

Elbow plot example from df

# Euclidean distance
df_euclidean <- daisy(df, metric = "euclidean", warnType = FALSE, stand = TRUE)

# max number of clusters to test
max_clusters <- 10 

# list to hold within-cluster sum-of-squares
wss <- rep(0, max_clusters) 

# loop over cluster
for (i in 1:max_clusters) { 
  segments <- kmeans(df_euclidean, centers = i, nstart=10)
  wss[i] <- sum(segments$withinss) # within-cluster sum-of-squares, summed over clusters
}

as.data.frame(wss)
          wss
1  402.138498
2  149.588725
3   26.886629
4   21.013860
5   16.578999
6   14.285304
7   12.939867
8   11.182604
9   10.316321
10   9.181797

Step 4: Select number of clusters

Elbow plot example from df

elbow_data <- data.frame(k = 1:max_clusters, WCSS = wss)

ggplot(elbow_data, aes(x = k, y = WCSS)) +
  geom_line(color = "blue") +
  geom_point(color = "blue") +
  labs(title = "k-means Elbow Plot",
       x = "Number of Clusters",
       y = "Within groups sum of squares") +
  theme_minimal()
  • Here it’s not that clear from the elbow plot
    • Because it’s too perfect
    • We would probably choose based on abilities

Step 4: Select number of clusters

Elbow plot example from DF

# Euclidean distance
DF_euclidean <- daisy(DF, metric = "euclidean", warnType = FALSE, stand = TRUE)

# max number of clusters to test
max_clusters <- 10 

# list to hold within-cluster sum-of-squares
wss <- rep(0, max_clusters) 

# loop over cluster
for (i in 1:max_clusters) { 
  segments <- kmeans(DF_euclidean, centers = i, nstart=10)
  wss[i] <- sum(segments$withinss) # within-cluster sum-of-squares, summed over clusters
}

as.data.frame(wss)
         wss
1  5441463.4
2  3182491.3
3  1862694.3
4  1550044.8
5  1310505.9
6  1232772.6
7  1194428.1
8   786449.4
9   747327.0
10  720655.1

Step 4: Select number of clusters

Elbow plot example from DF

elbow_data <- data.frame(k = 1:max_clusters, WCSS = wss)

ggplot(elbow_data, aes(x = k, y = WCSS)) +
  geom_line(color = "blue") +
  geom_point(color = "blue") +
  labs(title = "k-means Elbow Plot",
       x = "Number of Clusters",
       y = "Within groups sum of squares") +
  theme_minimal()

3 or 4 is best

Step 5: Interpret and profile the clusters

  • Main objects of interest
    • Segment sizes — % of sample
    • Cluster centroids = mean values of variables within the cluster
  • Assessment of segmentation criteria
    • Substantial — Unless of high value, potentially prune small segments
    • Actionable — Can we translate differences in clusters to targeted policies?
    • Differentiable — Sufficiently different to make/enforce different policies?

Step 5: Interpret and profile the clusters

A simple way (among many) to compute sizes:

segments <- kmeans(DF_euclidean, centers = 4, nstart=10)

DF$cluster <- segments$cluster

library(dplyr)
DF |>
  group_by(cluster) |>
  summarise(size = n(),
            proportion = round(n()/nrow(DF), 3))
# A tibble: 4 × 3
  cluster  size proportion
    <int> <int>      <dbl>
1       1   163      0.163
2       2   386      0.386
3       3    18      0.018
4       4   433      0.433

Step 5: Interpret and profile the clusters

A simple way (among many) to compute means:

DF |>
  group_by(cluster) |>
  summarise(across(c(spend_online, spend_retail, age, white, college, male, hh_inc), mean)) |>
  round(3)
# A tibble: 4 × 8
  cluster spend_online spend_retail   age white college  male hh_inc
    <dbl>        <dbl>        <dbl> <dbl> <dbl>   <dbl> <dbl>  <dbl>
1       1        137.         136.   42.8 0.711   0.524 0.552  102. 
2       2         31.1         50.0  43.3 0.878   0.695 0      126. 
3       3        777.         841.   44.2 0.838   0.548 0.056  117. 
4       4         55.7         49.2  37.9 0.761   0.416 0       67.0